IBM HR Analytics Employee Attrition and Performance Dataset

IBM HR Analytics Employee Attrition and Performance Dataset

In this study, we analyze HR data available from kaggle.com. This data is fictional and it is created by IBM data scientists.

Standardized Dataset

Problem Description

In the dataset, Attrition represents whether an employee is churned or not. We would like to create a predictive model that predicts this feature.

X and y sets

Correlation of the features

Training and testing sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Modeling: Gaussian Naïve Bayes

In this article, we implement scikit-learn's GaussianNB function which implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be \begin{align} P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right) \end{align} The parameters $\sigma_y$ and $\mu_y$ are estimated using maximum likelihood.

Some of the metrics that we use here to mesure the accuracy: \begin{align} \text{Confusion Matrix} = \begin{bmatrix}T_p & F_p\\ F_n & T_n\end{bmatrix}. \end{align}

where $T_p$, $T_n$, $F_p$, and $F_n$ represent true positive, true negative, false positive, and false negative, respectively.

\begin{align} \text{Precision} &= \frac{T_{p}}{T_{p} + F_{p}},\\ \text{Recall} &= \frac{T_{p}}{T_{p} + F_{n}},\\ \text{F1} &= \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\\ \text{Balanced-Accuracy (bACC)} &= \frac{1}{2}\left( \frac{T_{p}}{T_{p} + F_{n}} + \frac{T_{n}}{T_{n} + F_{p}}\right ) \end{align}

The accuracy can be a misleading metric for imbalanced data sets. In these cases, a balanced accuracy (bACC) [6] is recommended that normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two.

Gaussian Naïve Bayes with Default Parameters

Gaussian Naïve Bayes with the Best Parameters

In order to find the parameters for our model, we can sue RandomizedSearchCV. Here, we have defined a function Best_Parm to find the best parameters.

Since we have identified the best parameters for our modeling, we train another model using these parameters.

Conclutions

As can be seen, choosing the best parameters didn't improve our performance results.


References

  1. Kaggle Dataset: IBM HR Analytics Employee Attrition & Performance
  2. scikit-learn: classifiers
  3. scikit-learn: Metrics and scoring: quantifying the quality of predictions
  4. Confusion matrix
  5. Naive Bayes classifier wiki
  6. Mower, Jeffrey P. "PREP-Mt: predictive RNA editor for plant mitochondrial genes." BMC bioinformatics 6.1 (2005): 1-15.